--- title: Benford's Law Application and Interpretation date: 2024-02-10 categories: - Pandas - Accounting ---
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
Benford's Law, also known as the Newcomb-Benford law or the first-digit law, is a surprising observation about the leading digits of numbers in real-world datasets. In many naturally occurring collections of data, smaller leading digits (like 1 and 2) are significantly more common than larger ones (like 8 and 9).
Real-world data often involves growth, multiplication, and comparisons across different scales. This "scaling invariance" creates a natural bias towards smaller leading digits.
Benford's Law can be a quick and a powerful tool for detecting anomalies or fraud in data. If a dataset supposedly reflects real-world data but significantly deviates from Benford's Law, it might indicate manipulated or fabricated numbers.
df = pd.read_csv('DC_PCard_Transactions.csv')
df.head(3)
df.info()
The code below grabs the first digits of the
'TRANSACTION_AMOUNT' column after converting the column
into a string type.
# remove transactions with amounts that are negative or has a leading zero
# retrieve the first digit and use value_counts to find frequency
df_benford = df['TRANSACTION_AMOUNT'] \
[df['TRANSACTION_AMOUNT'] >= 1] \
.astype(str).str[0] \
.value_counts() \
.to_frame(name="count") \
.reset_index(names="first_digit") \
.sort_values('first_digit')
# calculate percentages
df_benford['actual_proportion'] = df_benford['count'] / df_benford['count'].sum()
df_benford
Benford's proposed distribution of leading digit frequencies is given by
\begin{equation} P_i=\log _{10}\left(\frac{i+1}{i}\right) ; \quad i \in\{1,2,3, \ldots, 9\}, \end{equation}
where $P_i$ is the probability of finding $i$ as the leading digit in a given number.
Create a new column that contains the Benford's proposed distribution of leading digit frequencies.
# append an expected_proportion column that contains Benford's distribution
df_benford['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(1, 10)]
df_benford
fig = px.bar(
data_frame=df_benford,
x='first_digit',
y=['actual_proportion', 'benford_proportion'],
title='<b>Proportions of Leading Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
labels={
'first_digit': 'First Digit',
},
height=500,
barmode='group',
template='simple_white'
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
yaxis_title_text='Proportion',
yaxis_tickformat=',.0%',
legend_title=None,
)
fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'
fig.show()